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Abstract 

In this work, a new data-driven shrinkage estimator for the population precision 
matrix has been introduced using random matrix theory. The new estimation is 
non-parametric without assuming a specific parameter distribution for the data and 
also there is no prior information about the structure of the population covariance 
matrix. We demonstrate by both theoretical and empirical studies that the new 
estimator, which is applicable for p > n, has good properties for a wide range of 
dimensions and sample sizes. Moreover, even if p < n, our new method always 
dominates the inverse sample covariance matrix and performs comparably with 
existing methods. 

Key words and phrases: Random matrix theory, shrinkage estimation, large 
dimensional data, precision matrix. 



1 Introduction 

Suppose observations X] , • • • ,X n indepen dently from a multivariate model (See, 
iBai and S aranadasal l 1996 or lChen et a/.l l2010) 



X i =l} p /2 Y i + n <) , i= l,...,n (1) 

where /io is a ^-dimensional constant vector and L p is a p x p positive definite matrix 
which acts as the population covariance matrix. Here Y = (Y\ , ■ ■ ■ ,Y n ) = (Yij) pxn and 
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{Yij,i,j — 1,2,- ••} are real independent and identically distributed (i.i.d.) random 
variables with common mean zero and unit variance. 

In multivariate analysis, estimation of the covariance matrix L p and precision ma- 
trix Q. p := T,p l is an important problem. Given the samples {X\, ■ ■ ■ ,X n }, the usual 
estimation of L p is the sample covariance matrix which is defined as 



1 



I Eft- 



-X)(Xj-X? 



(2) 



where X = i L" =1 X, and the superscript r denotes the transpose of a matrix or vector. 
Naturally, in many areas of statistical analysis, S~ 1 acts as the co mmon estimator o f 
D. p ; among the practical exam ples are linear discriminant analysis dAndersonl l2003h . 
T 2 statistic jHotellingl Il93lb . Markowitz mean-variance analysis (IMarkowitzi 1 19521) 
and so on. 

In classic statistics where p is fixed and n — > °°, S^ 1 is a good estimator for Q. p . 
In the large dimensional data setting where the data dimension p is large compared to 
the sample size n, the usual estimator which simply takes the inverse of the sample 
covariance matrix has two disadvantages. Firstly, S„ is singular if p > n which means 
we can not get a stable estimator for Q. p . Secondly, even if p < n, S^ 1 as the estimator 
of Q.„ is known to p erform poorly. For example, if p/n—t y S (0, 1), by Remark 2 of 



Pan an d Zhou ( 201 II) . we have 



y{i+y-y 2 ) 
(i-y) 3 : 



(3) 



which shows that the estimation error is dramatically large, especially when y—tl. 

To estimate the precision matrix £2 p , assuming certain structures such as sparseness 
or ordering structures, methods based on penalties or thresholds have been proposed 
and s tudied (see, e.g. iFriedman et aZl l2008l pY\ianll2010L IShao et al\ 201 1, Ca i et al.\ 
201 lb . In the absence of prior information about the covariance matrix, shrinkage 



based methods have been proposed in many wo rks to improve standard estimators 
since t he seminal paper of IJames and Steinl dl96ll) . About population covariance ma- 
trices, iFriedmanl (11989b firstly co nsidered shrinkage-base d estimations and applied 
them to discriminant analysis. In iLedoit and Wolfj (12004b . authors studied the sam- 



ple covariance matrix S„ with respect to a quadratic loss function and proposed a 
well-conditio ned estimator wh i ch is a linear combination o f S n and identity matrix 



Following ILedoit and Wolf (2004), Fisher and Sunl ( 201 1 ) con centrated on sh rink 



ing the sample covariance matrix to different target matrices and I Wartonl (12008b pro- 
posed a penalized likel ihood approach to estimate shrinkage parameters. Recently, 
Ledoit and Wolf (1201 2b conducted an estimator for XL by non-linear shrinking the 
eigenvalues of the sample covariance matrix. Compared with E /7 , estimation of the 
precision matrix Cl p is more inv olved due to no expli cit formulas for the common 
Frobenius-typed loss functions. In Mestre and Lagunas ( 2006b . authors recommended 
using (1 — p/n)S„ to estimate Q. which is a rescaling of the inverse of the sample co- 
variance matrix and ILedoit and Wolf] (1201 2b proposed an non-linear shrinkage method 
for Q. n . 
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Although the shrinking estimators proposed by ILedoit and Wo lf (2004) and suc- 
cessive works are invertible and more accurate than sample covariance matrix S n to 
estimate L p , their inverses are usually not the best one for Q. a m ong the combinations 



of S n a nd I p . Moreover, the methods in ILedoit and W olf (2012) or Mestre and Lagunas 
(2006) are only applicable for p < n. In this work, motivated by ILedoit and Pecha 



1 /2 

d201 ll) . we will study the asymptotic properties of the matrix T, p (S n + A/ p ) _1 Z 
and its relation with (S„ + Xl p )~ . Based on these limiting results, we propose an 
optimal linear combination of S„ and I p under the loss function 

-tr{Z p (X 1 S n + folp)- 1 -I p ) 2 , 

The new estimation is non-parametric without assuming a specific parameter distri- 
bution for the data and also there is no prior information about the structure of the 
population covariance matrix. The new estimator has no restriction on p < n and is 
applicable for p >n. Even if p < n, the new estimator will al ways d ominate the stan- 
dard S„ an d (1 ~~ p/ n )Sn l proposed by iMestre and La gunas (2006). It also p erforms 
comparably with the non-linear shrinkage estimator in ILedoit and Wolfl (120121) . 

The rest of the article is organized as follows. Section 2 introduces the asymp- 
totic properties about the sample covariance matrix. Section 3 develops a data-driven 
shrinkage estimator. In Section 4, we conduct simulations in simulated data and real 
data to evaluate the proposed optimal shrinkage estimator and compare them with exist- 
ing shrinkage methods. Conclusion are presented in Section 5. The Appendix contains 
the proofs of the technical results. 
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2 Asymptotic Properties 

Firstly, we need some notations in RMT Supposing A„ is an n x n Hermitian matrix 
with eigenvalues A,/, j = 1 , • • • , n, we can define a one-dimensional distribution function 

F A »(x)= ) -£l(lj<x) 

called the empirical spectral distribution (ESD) of the matrix A„, where /(•) denotes the 
indicator function. The limit distribution of F A " if it exists and which is non-random is 
called the limiting spectral distribution (LSD) of the sequence {A„}. The importance 
of ESD is due to the fact that many important statistics in multivariate analysis can 
be expressed as functionals of the ESD (e.g., det(A n ) = exp(n J lnxd F A "^), tr(A„) = 
n [ xdF A " W ; etc). More details can be found in a recent monograph by Bai and Silversteinl 
(2010). 



In RMT, we usually study the limit of Stieltjes transform of F which is given by: 

m F (z) = [ —dF(t), z E C+ = {z € C : Imz > 0}. (4) 
J t—z 

Because of the inversion formula 

F{[c,d]}= lim - [ d lmjn F (^+iri)d^, (5) 

7)^0+ % Jc 
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where c < d are continuity points of F, F is uniquely determined by its Stieltjes trans- 
form. 

In this work, we make the following assumptions 

Al. p,n — > °° such that p/n—tyE (0,°°) and the fourth moment of Yij is bounded; 

A2. The extreme eigenvalues of L p is uniformly bounded that is there are constants 
c\ ,C2 satisfying c\ < X m j n (L p ) < X„ wx (Ep) < C2 and F z '> tends to a non-random 
probability distribution H. 

Now, we can introduce a lemma about the LSD of E p ' (S„ + Xl p )T. p ' . 

Theorem 1 Under the conditions of Al and A2, as n — > °° F z p 1 ( s "+' l/ p) r P / con- 
verges a.s. to a non-random distribution F , whose Stieltjes transform m{z) satisfies 

= / 1 ( 6 ) 



t Z_r \+ym(z) 



where A > and z £ 



Remar k 1 The result ofTheorem\l\also can be derivedfrom Theorem 1.2 i mLedoit and Peche 
( 120771) where the 12 th moment is needed. 

By Silversteinl (1995), we know the Stieltjes transform mo(z) of LSD of S n is the 
solution of the following equation 

■»&>=/ ,,, " H( " M1 C7) 



More analytic behaviors of Equation (0 are referred to ISilverstein and Choil d 1995b . 
Now, we can study the relationships between l}/ 2 (S n + A7 p )~ 1 £ ; 1 / 2 and (S„ + A7 p ) _1 . 

Theorem 2 When A > 0, under the conditions of Al and A2, as n — > °° almost surely 
ifr(Ey 2 (5„ + A7 p )- 1 E ; 1 , /2 )^ J R 1 (A), 
l -tr{Y}l\s n +Xl p )- x l}J 2 ) 2 ^R 2 {X), 



and 



What's more, 



«i(A) = 

*2(A) : 



-tr{(S n + Xl p )- l )^m {-X). 
P 



1 — Aw()(— A) 



l-y(l-Am (-A))' 

1— Amo(— A) Amo(— A) — A 2 mQ(— A) 

(1 -Am (-A)))3 " (1 - Am (-A))) 4 ' 

\-2\ s. „J( 1 \ tfai(z) 



7/ere, almost surely, jtr((S n + Xl p ) ) — >m'(— A) = | 7= _a- 
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In applications, we cannot derive the statistics j ? tr(T.]/ 2 (S n + Xl p )~ 1 'Lp 2 y , j = 
1,2 directly because only S„ is known. Theorem [2] provid ed a theore t ical m ethod to 
estimate the statistics from the sample co variance matrix. In lChen et al\\2Q\ lb . authors 
derived a similar result to that of Theorem [2] under Gaussian assumptions and here no 
distribution assumptions were needed. About mo (—A) and equation ©, we have the 
following result. 

Lemma 1 In Theorem^ mo(— A) is the unique solution of the equation 

dH(t) 



(l-y+yAm(z))+A' 



(8) 



where 1 — y+yXm(— X) > 0. 



Remark 2 In LemmaU] the condition 1 — y + yXm(— X ) > is a variant of the condition 
under formula (1.4) in \Silverstein and Choi \l995 ) and is necessary. Here we use an 



example to illustrate this point. Assuming T. p — I p , two solutions of the Equation d5]l 
can be written out as follows 

m (1) (-A) = ^(-(1 -y + X) + ^/(l-y + A)2 + 4yA), 

m (2) (-A) - 2T(-(1 -y + A) - ^(l-.v + A)2 + 4yA). 

Direct clacul ations can show 1— y+yXm^>(— A) > and 1 — y+yXm( 2 \— A) < and 
this is why in Chen et all 1 201 1\ ), authors claimed mo(— A) = m^{— A) not mo(— A) = 
m< 2 )(-A). 



3 Optimal Estimator 

To estimate £l p = £r , we consider a class of estimators such as Q p = a(5„ 
j3lp)~ l . By Theorem[2] with probability 1, 



tr(L p *£l-I p ) 2 -» a 2 /? 2 (j3)-2a/^i(j3) + l 



- /?2( ^ )(0£ -%M ) + 1 ~^T (9) 

Therefore, to minimize the loss function ([9]), a should satisfy a — gjol an d the corre- 
sponding loss is 

Intuitively, j3 should minimize L(j3). About the optimal loss Lq = minp >0 L(l5), we 
have the following results. 
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Theorem 3 When y < 1, writing 



t ,,, i ..,_ (/ _ / „ M „_ s __ j) , 7S „, 



we have Lq = min.y > oL/f (y). What is more, 

7 *p = o%, 



I. When T. p — (T 2 I p , which means H(x) is a degenerate distribution at a 1 , the optimal 



loss is Lq = and i2* = (J~ 2 I p . 
II. For general distribution H(x), Lh{j) achieves its global minimum values Lq at y* 
satisfying 

Mf)Mf)- fiif) >fi(f) ^ 



/ 2 (r)/2(r)(/i(r)-/ 2 (r)) 

~* p 



where ft{x) — J (j^) k dH(t). Moreover, j3* satisfies the equation y* = 



anda* = ^f\. 

For y > 1, we can also get a similar result from the proofs of Theorem[3]and therefore 
will not pursue it due to limited space. The following corollary gives a data-driven 
method to estimate the best j3 . 



Corollary 1 Under the assumptions of Theorem\2\and writing y — p/n, «i(A) = 1 
\tr{{S n +I„)-\ a 2 {X) = Ur{{S n +I p )- 1 - ^tr^Sn+I,,)- 2 and 



tfl(A) 
Ri (A) 

almost surely, as n — > °° 



1 — 5>a i (A ) ' 

fli(A) 02(A) 



(l-y fll (A)) 3 (l-y«i(A)) 4 ' 



/?!(A)^/?i(A), 
k 2 {X)^R 2 {X). 



By Corollary Q] and continuous mapping theorem, with probability 1, 

KP > R 2 (P) KIJ) 

Therefore, for a real data or a sample covariance matrix S„, we can use a numerical 
algorithm such as Newton-Raphson method to find the optimal j3* fromL(/3). 

Compared with existing methods, the new estimator Q. * = a* (S n + /3 * / D ) has the 
following properties. Firstly, when y < 1, by Remark 2 of IPan and Zhoul d201 ll) 

ltr{{l-y)Y.S- l -I p fA-^, (12) 

p i-y 
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which is the loss of estimator in Mestre and Lagunasl d2006l) . By proofs of Theorem[3j 
we know the optimal loss of our estimator is Lq < Lh (0) = y. Together with formula 
(O, it is shown that when y < 1, th e new estimator will alway s dominate the standard 
5,7 1 and (1 —p/n)S~ l proposed by Mestre and Lagunasl d2006l) . Secondly, wheny > 1, 
we can still use the new estimator while the estimators based on S„ or the no n-linear 
shrinkage estimator based on the eigenvalues of S n in iLedoit and Woln d2012l) are not 
applicable any more. 



4 Simulations 

In real data such as gene data, samples usually come from different scales which 
means that the sample variances of the genes range over a wide interval. So, it is neces- 
sary to apply a normalization procedure to the data which adjusts valu es measured on 
differ ent scales to a notionally common scale (See, iDudoit et al. 2002 or lFan and Fan 
2008 ). Inspired bv lFisher and Sun ( 201 1 ), we propose a two-stage procedure to further 
improve the numerical performance of our new methods. Firstly, we normalize the data 
to eliminate the effect of different scales, that is, we will consider the sample correla- 
tion matrix S n . Based on S„, we propose a shrinkage estimation R p = a(S„ + /3/ p ) _1 
for the true inverse correlation matrix S" 1 . Secondly, we use diag(S„) to estimate 
the sample variance of each component. Combining these two estimators, a shrinkage 
estimation £2* for precision matrix £2 p is derived. 

Now, we turn to the numerical performance of our methods. For comparison pur- 
poses, we also conducted the following shrinkage methods: 

• The inverse of the sample covariance matrix S~ , (Standard estimator); 

• The estimator (1 — p/r^S^ 1 of Mestre and Lagunasl ( 2006 ) (M-L estimator); 

• The non-linear shrinkage estimator of ILedoit and Woli ( 2012 ) (L-W estimator); 



• The estimator for £ p where T 



diag(S„) of Fis her and Sunl d201 ll) (F-S estima 



tor); 

It is noted here that F-S estimator is not for Q. p and we will only conduct it for com- 
parison when p >n and where M-L estimator and L-W estimator are not applicable. 
Moreover, these estimators were designed under other criterion than the current loss 
function and they were conducted here only for comparison purposes. In the simulated 
studies, to mimic true population covariance matrices, three models will be considered 
as follows: 



• Model 1. Ei is diagonal with 20% of population eigenvalues are equal to 1, 40% 
are equal to 3 and 40% are equal to 10; 

• Model 2. £2 = £}^ 2 £o£[^ 2 where £0 = (<?ij)pxp an d Gij — 0.6''~- / ' for 1 < i,j < 

p; 

1/2 1 /2 

• Model 3. £3 = £/ £oo£i where £00 = [Gijjpxp an d Oiy = 0.5 for i ^ j. 
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Figure 1 : Empirical risk of the proposed estimator £2* and existing methods on Model 
1. 



Model 1 is a diagonal spiked example w hich was well studied in RMT (See. lBai and Silverstein 
19981 iBai et a/.H2010l lRao et aZ.l l2008). Another reason for conducting Model 1 is 



due to no open code f or L-W estimator and we will repeat part of simulations as 



Ledoit and W olf (2012) to compare our method with L-W estimator. Model 2 is an 



example of a sparse matrix where the values of the entries decay as they move away 
from the diagonal. Model 3 serves as a dense matrix example. In simulations, without 
loss of generality, we assume jUo = 0. About the random part, as the simulation results 
for other distributions followed very similar patterns to those of Gaussian data, we will 
only report the simulation results for the Gaussian data, that is Yij,i,j — 1,2, • • • are 
i.i.d from standard normal distributions. All the simulations results are based on 10 3 
replications. 

4.1 Influence of convergence 

Since the proposed estimator is derived from an asymptotic optimal loss function, 
the first set of simulations will show how the new shrinkage estimator £2* behaves as 
the matrix dimension p and the sample size n go to infinity together. We a ssume that 



the ratio p/n remains constant and equal to 1/3 as iLedoit and Woln d2012l) did. The 



result for Model 1 is plotted in Figure [T] and Figure [2] are empirical performances for 
Model 2 and Model 3. 

From Figure Q] and [2] we can see that the performance of the proposed shrinkage 
estimator £2* converges quickly toward to the optimal risk Lq which was calculated by 
Theorem[3] Even for the relatively small sample dimension, the proposed £2* behaves 
better than 5,7 1 or M-L estimator. These simulations demonstrate that the proposed 
estimator dominates the classic 5,7 1 and M-L estimator. 
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Figure 2: Empirical risk of the proposed estimator £2* and existing methods on Model 
2 and 3. 



For comparison with L-W estimator, we use the same criterion PRIAL which 
was defined in formula (6.2) of Ledoit and Wolf (2012) and repeated simulations for 
Model 1. Empirical PRIAL is reported in Figure [3] Compared with Figure 6 of 



Ledoit and Woln (120 12l) . one can see the proposed estimator converges quicker than 



L-W estimator to the optimal PRIAL 1. What's more, even for small sample dimen- 
sions such as p = 30, n = 90, the PRIAL of £2* is 97% while the one for L-W estimator 
is 91%. Before a rriving at a general concl usion, it is necessary to compare our estima- 
tor to the one of Ledoit and Wolf d2012h in some other simulations. As there is no 
open code for L-W estimator, we cannot make any further comparison. From current 
simulations, we say, even if p < n, the new estimato r still has a better or comparable 
performance compared to the non-linear estimator of iLedoit and Wolf ( 2012 ). 

4.2. Influence of n and p 

In the last subsection, we have fixed n = 3p. Here we will investigate the influence 
of the sample size n for fixed sample dimension p or the one of p for fixed n. Since 
these two scenarios are equivalent, we will only fix the sample size n = 100 and let p 
range from 10 to 200. As the above estimators for £2 p are only applicable for p <n, the 
F-S estimator on estimating L p will be included as a comparison. Results for Model 2 
are reported in Figure |4] 

From Figure |4j we can see that for the fixed sample size n — 100, when p was 
increased, the risk increases. This could be understood as when p increases, more 
parameters need to be estimated and the corresponding loss will be higher. Over all, the 
proposed estimator behaves better than F-S estimator and all the risks of the proposed 
estimator are small, no greater than 0.4. 

4.3. Additional simulations 

For a good estimator £2 for E p 1 , intuitively, £p£2E]/ 2 should behave like an iden- 
tity matrix I p . To better illustrate the performance of different estimators, Figure [5] 

1/2*1/2 * 

shows heat-maps of E p £2E ; y for each estimator £2. All of the heat-maps suggest that 
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Figure 3: Empirical PRIAL of the proposed estimator £1* and existing methods for 
p/n= 1/3. 
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Figure 4: Empirical influence of the data dimension p about the proposed estimator £1* 
and existing methods for fixed sample size n — 100. 
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Standard Estimator 




Proposed Estimator 




Figure 5: Heat-maps of different estimators where p = 50, n = 100 and £ = £3. White 
represents 0, and black represents 1 . 

the proposed estimator performs better than existing methods to estimate the precision 
matrix. 

4.2 Analysis of Real Dataset 

In this section, we apply the proposed shrinkage estimator to Leukemia data. The 
Leukemia data set contains p = 7129 genes for 47 acute lymphoblastic leu kemia (ALL) 



and 25 acute myeloid leukemia (AML). More details can be found in iGolub et al. 



1999b and the data is available at http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi. 



For a fair comparison wit h other methods of estimating the inverse covariance matrix, 
we follow similar steps as lFan and F an (2008) and select the top p — 20,40,60,80, 100 
genes with the largest two sample f-statistics score f or further study. For other classifi 



cation methods o n Leuk emia data, one is re ferred to Dudoit et al. (2002), Fan and Fan 
d2008h . ICai et all d201ll) or our recent work lWang et al.\ J2012h 

Specifically, we will consider four classification distances as follows: 

• LDA: di = (xq —Xi)'S (xo — jc/), i — 1,2; 

• DLDA: di = (xq — Xij'D^ (jrrj — xi), i = 1,2; 

• LQD: dj = (jco — jcj)'£j" (xq —Xi) + log (det (£/)), i = 1,2, where £r will be 
estimated by our proposed method or F-S method. 
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Table 1: The average misclassification rates (%) for classic LDA and LQD using the 
proposed £1* and others methods on Leukemia data. Standard deviations are given in 
parentheses. 





p = 20 


p = 40 


p = 60 


p = 80 


p = 100 


LDA 


8.29(4.94) 


NA 


NA 


NA 


NA 


DLDA 


6.75(3.56) 


4.86(2.85) 


4.76(2.73) 


4.79(2.85) 


4.59(2.62) 


LQD(F-S) 


6.67(5.00) 


4.58(3.27) 


4.63(3.18) 


3.85(3.18) 


3.63(2.93) 


LQD(Proposed) 


5.60(2.83) 


4.48(3.23) 


4.36(3.25) 


3.78(2.89) 


3.42(2.38) 



Here xq is the testing sample, i, is the average values of within the training set i, S is the 
pooled sample co variance matrix and D$ = diag(S). For each xo and distance function 
di, if d\ < d2, then xq is classified into group 1, otherwise into group 2. To evaluate 
the performances of the proposed shrinkage estimator and the existing methods, we 
specifically set approximately 1 /2 of the observations that is 23 ALL and 12 AML as 
the training set, and the others will serve as the test set. 

Table Q] reports the average misclassification rates based on 10 3 bootstrap replica- 
tions on the top p = 20, 40, 60, 80, 100 genes. Similar to the above simulation studies, 
it is evident again that the proposed shrinkage estimator i2* outperforms the existing 
methods. 



4.3 Conclusions 

A new data-driven shrinkage estimator for the population precision matrix has been 
introduced using RMT . The new estimation is non-parametric without assuming a 
specific parameter distribution for the data and also there is no prior information about 
the structure of the population covariance matrix. Compared with existing methods 
which are only applicable for p < n, the new estimator d ominates the standard S~ 
and (1 — p/n)S~ l proposed bv lMestre and Lagunasl d2006l) and pe rforms comparably 
with the non-linear shrinkage estimator in iLedoit and Woln (I2012I) . What is more, we 
demonstrate by both theoretical and empirical studies that the new estimator, which is 
applicable for p > n, has good properties for a wide range of dimensions and sample 
sizes. 



Appendix: Proofs 
4.4 Proofs of Theorem [1] 

By Corollary A.41 of iBai and Silverstein (2010), we have 



L (FT " ,F7> ) < ( ) 2 A 2 -fr(E- 2 ) 

P y P 



On the other hand, by A. 2, we can get 

1 

P 



tr(Z p 2 ) < C. 
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Noting ^ — > y, we have L (F p '' ,F >' '' ) — > 0. Therefore, the Stieltjes transform of 



the LSD of f"i' Z p 1 is 



m H {z) 



-dH(t) 



where z G C + . Then by Bai and Silversteinl ( 20ld chap. 4) or the main theorem inlPa: 



Stieltjes transform m\ (z) satisfies 



"H — YY 1 + — £~ 

(2010), as n — > °°, F p p converges a.s. to a non-random distribution F\ , whose 



mi(z) = y 



ty z + >•(!+«! (z)) 



(13) 



vf 1 yy^ i 

It is a simple matter to verify that F n P p ' converges a.s. to a non-random 
distribution F%, whose Stieltjes tra nsform is m-iiz) = 

Similarly,by Corollary A.41 of Bai and Silverstein d2010h . we can prove y(^YY r + 

1 ) and £ YY T + XY,~ 1 ha ve the same LSDs. Here w e also use the fact that the sup- 
port of F\ or F2 i s bounded by Bai and Silverstein ( 1998b or the extreme eigenvalues of 
S n is bounded by IPan and Zhoul (1201 II) . 

Above all, we have proven, as n — > 00, f jfY^+AE converges a.s. to a non-random 
distribution F2, whose Stieltjes transform m.2 (z) satisfies 

1 



ni2{z) 



t l+ym 2 (z) 



dH{t). 



Finally, by Theorem A.43 of iBai and Silverstein (2010), we have 



\ F Y.„ lll (S„+Xl p )Z p 112 _p\YY T +XZ- i I 



-1/2 

P 

1 — — T 1 

-rank(YY T ) < -, 
P P 



< -jank(£ p L '\S n + Up)Z p 1/2 - (^YY r + AE p 1 )) 



where ||/|| = sup Y |/(jc)|. The proof is completed. 



(14) 



4.5 Proofs of Theorem H 

Firstly, we need the following lemma. 
Lemma 2 Under the conditions of Theorem^ almost surely, 

-tr{l}J 2 {S n +kI p )- l l}J 2 )^Rr{Xl 

i?r(4 /2 (S n + kip)-' E ; V 2 ) 2 -+Ri(k) 
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where R\{X) and R2 (A ) satisfy 



R] a) l j 1 dH(t) 

t + 1+yR^l) 
, yR 2 (X) 



J yj+ 1+yR^)' 

Proofs: By the definition of ESD and Helly-Bray theorem, 

p J X 



p 

1 



a.s 



—dF(x) — limm(z), 

x z^O 



-tr(4 /2 (S„ + Xl p )- l 4 /2 ) 2 ^ f \dF^ lll ^ + ^'\x) 



U.S. 



-^rdF(x) = limm'(z), 

xr z^O 



that is 



Ri(X) = f-dF(x) = limw(z), 

J x z^O 

R 2 (X) = [ ^rdF(x) = limm'(z). 



By Equation [6] we have 

^ , ym'(z) 
m'( Z ) = J J^f dH( t ). 

yj- z + \ +ym ( Z) ) 

For both sides of Equation (O and (1171 1. letting z — > 0, we can get 

*i(A)= / x 1 j rftf(r), 

< + l+vJ?i(A) 



(f+ 1+ 3 4(A)) 2 



The proof of Lemma|2]is completed. 
By LemmaQ] we have 



mo(-X)=J T 



dH(t) 



f(l -y+yXmo{-X))+k 
In Lemma 12 writing 

l[X) -X {1 - 1+yR^Xy 
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that is 

l-Av(A) 



«i(A) 



l-;y(l-Av(A))' 
then 

l-Av(A) 
1 -y(l -Av(A)) 

Further calculations can show 

V(A) : 



1+ ^l-3>(l-Ai<(A)) 



-dH(t). 



dH(t) 



r(l -y+;yAv(A))+A' 
which is the same as Euqation ([S]). In addition, 

1 - y +y Av(A) = 1 -v +y (l - T Ag)_ ) = —L^ > o. 

Hence, v(A) = mo (—A) that is 

1 — Amo(— A) 



*i(A) = 



-y(l - Amo(-A)) ' 

Further, 

</i?i(A) m (-A)-Am (-A) 



rfA (l-y(l-Am (-A))) 2 ' 
By formula ([TBI . 

j_ _ j-R'i(A) 

*UA) = -/-y™^M, 

7 ^7+ i+>'fi 1 (A)i 

then 

1 , (l+>^i(A)) 2 ( J R 1 (A) + A/?' 1 (A)) 



+ -T- 



(7 + TTwbry) 2 l+yRiW+yUliCA) 
By ( fT6l ), we have 

fl 2 (A) - ^ + < + .-W 



l+vR,(A) ' 

= (l+y J R 1 (A)) 2 (^ 1 (A)+A/?' 1 (A)). 

Together with ( TT~8b and d!91 l. we can show 

„ 1 - Amp(-A) Amp(-A) ~A 2 m (-A) 

21 ' ~ (1 -y(l -Amo(-A))) 3 (1 -y(l -Amo(-A))) 4 ' 

The proof of Theorem[2]is completed. 
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4.6 Proofs of Lemma Q] 

Firstly, mo(— A) is the solution of Equation (0. Added to this, almost surely, 

-fr(y5„ +/ P ) _1 -> Am (-A). 
/? A 

Then, Amo(— A) > min(0, 1 — ^). Hence 1 — y+yAmo(— A) > 0. 

Next, suppose we have two solutons M\ , M 2 of Equation ^ and 1 — y+yXMj > 
0, j = 1,2. Then 



dH(t) 
{l-y+yXM^ + V 



dH(t) 



M 2 



Hence, 

M\—Mj_ = (M 2 - Mi 
If Mi ^ Mz, we have 

-1 = 



f(l -y+yAM 2 ) + A' 
ytdH{t) 



(t(l-y+ylM 1 ) + l)(t{l-y+ylM 2 )+k) 
ytkdH(t) 



(r(l -y+yAAfi)+A)(f(l -y + yAM 2 ) +A) 

yrA 

(r ( 1 -y+yMi )+X)(f(l -y+>AM 2 )+A) 



which is in contradiction with fff1 _ v ^ va-Af , ^^ ff1 _ v v/ , 0. Therefore. Equalion 
([8]) has a unique solution. 



4.7 Proofs of Theorem g] 

By Lemma [2] 

«i(P) = / g 1 ; <*»(*) 

< + l+vfli((3) 
1 1 y« 2 Q8) 

Rm= [ p + T^ dH(t) . 

(t+ i+vfl!(/5)) 

Moreover, 1+y ^^^ = 1 — y(l — j3mo(— J3)) where mo(— z) is the Stieltjes transform 
of LSD of S„. Denoting the LSD of S„ as (x), then 

mo(-p) = f^dF^(x). (20) 

Further, if we define another distribution function as 

F« = (l-y)/ (0H W+yF(°)W, (21) 
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and 

mi (-/?) = j-L.dF^\x), (22) 
we have 1 - j3m (-/3)) = j3mi(-/3). Writing 7 = y(/3) = l/mi(-j3), then 

*i(0) = l / rrr^W, 



j37 f+y 



Therefore, we have 



i-yf(^- r ) 2 dH(ty 



= l-([—dH(t)) 2 { ^ y) 

= Mr)- 

About the y(j3), we have 

1 



YiP) 



which is a strictly increasing function on j3 . Therefore, 7 and j3 are one-to-one corre- 
sponding. Specially, when y < 1 that is F^\x) has a point mass 1— y at the origin, the 
function y(/3) : (0,<») 1 — ► (0,°°). Wheny > 1, the function y(j3) : (0,°°) 1 — > (70, °°) 
where y Q J 1 /xdF^\x) = 1. Above all, we have 

minL(j3) ~mmL H (y), y < 1, 

p>o y>o 

minL(S) = mmL H (v) 7 y > 1. 

/3>o y>ro 

When H(x) is a degenerate distribution at a 2 , 



a 



2 



L H (y)=y{-^—) 2 . 
o A + y 

Obviously, L H (y) achieves its minimum value Lq = at 7* = 00. So, j3* — > °° and 

1 _W)_ Km ^ =0| 



a* /?i(j3" 



a 



4 

)3* 1 (J+yY 

— = hm -- — ^— = a , 

y— >oo y Of 
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which means the theoretical optimal estimator is Q.* = (J I t 
For general distribution H(x), denoting fk{x) = J(j^) k o 

dL H (x) fi(x) 



dx ~ {f 2 (x)f 
2/i (x) 



(Mx)ti(x)-2(1 -yf 2 (x))f{(x)f 2 (x)) 
T (Mx)(f 3 (x) -f 2 {x)) - (1 -yf2(x))(f 2 (x)-Mx))f 2 (x)) 



2Mx)(f 2 (x)-Mx)) Mx)Mx)-f 2 (x)Mx) 
[y Mx)f 2 (x)(fi(x)-f 2 (x)) ) 

k 

where we used the facts that f k {x) = —kj dH{t) and xf k {x) = k(fk + i(x) — 

AW). 

Writing g(x) = /^/{(Ift/^j-^^)) ' " is eas y t0 show lim ^o+ g(x)=0 and lim x ^ +co g( 
+00. Therefore, Lh(j) can achieve its global minimum value at 7* which satisfies 

h{t)h{t)- fi{f) fiit) 



/ 2 (r)/ 2 (r)(/i(r)-/2(r)) 



Therefore, by the definition of y(/3 ), when y < 1 , j3 * satisfies the equation 7* = i_- y (i_^ mo (_ | fl 
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